In this lecture we will consider other data types such as lists, data frames as well as graphics.
Factors are determined through categorical variables. What are categorical variables?
# Create a blood group vector
blood_group_vector <- c("AB", "O", "B+", "AB-", "O", "AB", "A", "A", "B", "AB-")
blood_group_vector
## [1] "AB" "O" "B+" "AB-" "O" "AB" "A" "A" "B" "AB-"
# Create fatcors from the vector
blood_group_factor <- factor(blood_group_vector)
blood_group_factor
## [1] AB O B+ AB- O AB A A B AB-
## Levels: A AB AB- B B+ O
Note:
R encodes factors to integers for easier memory access and computations. This is done alphabetically. For example, A is assigned 1, AB is assigned 2 etc. This can be viewed by invoking the str() function:
# Show the structure of the blood group factor
str(blood_group_factor)
## Factor w/ 6 levels "A","AB","AB-",..: 2 6 5 3 6 2 1 1 4 3
print(blood_group_vector)
paste(as.character(as.integer(blood_group_factor)), " ")
## [1] "AB" "O" "B+" "AB-" "O" "AB" "A" "A" "B" "AB-"
## [1] "2 " "6 " "5 " "3 " "6 " "2 " "1 " "1 " "4 " "3 "
This can be over-ridden by specifying the levels argument for the factor() function.
# Define another set of levels over-riding default
blood_group_factor2 <- factor(blood_group_vector, levels = c("A", "B", "B+", "AB", "AB-", "O"))
print(blood_group_factor2)
## [1] AB O B+ AB- O AB A A B AB-
## Levels: A B B+ AB AB- O
str(blood_group_factor2)
## Factor w/ 6 levels "A","B","B+","AB",..: 4 6 3 5 6 4 1 1 2 5
# Comparing the default alphabetic order with the new one:
as.integer(blood_group_factor)
## [1] 2 6 5 3 6 2 1 1 4 3
as.integer(blood_group_factor2)
## [1] 4 6 3 5 6 4 1 1 2 5
Renaming factors can be done using the level() function.
# Define blood type
blood_type <- c("B", "A", "AB", "A", "O")
# Find the factors
blood_type_factor <- factor(blood_type)
blood_type_factor
## [1] B A AB A O
## Levels: A AB B O
# Rename the factors
levels(blood_type_factor) <- c("BT_A", "BT_AB", "BT_B", "BT_O")
blood_type_factor
## [1] BT_B BT_A BT_AB BT_A BT_O
## Levels: BT_A BT_AB BT_B BT_O
Note: It is extremely important to follow the same order as the default order supplied by R. Otherwise, the result can be extremely confusing as the following exercise will show.
Classwork/Homework: Rename the blood_type_factor in the above example as follows:
levels(blood_type_factor) <- c("BT_A", "BT_B", "BT_AB", "BT_O")
and justify the result of outputting blood_type_factor. Use str() to support your answer.
If you want to safely rename your levels or to change their default order, it is always best to define the labels along with the levels like this -
factor(blood_type_factor, levels=c("A", "B", "AB", "O"),
labels=c("BT_A", "BT_B", "BT_AB", "BT_O"))
An easy and fast way to generate a simple factor with given number of repetitions is by the function gl()
factorZ <- gl(3, 2, length = 12)
print(factorZ)
## [1] 1 1 2 2 3 3 1 1 2 2 3 3
## Levels: 1 2 3
Nominal factors: These are categorical variables that cannot be ordered, like blood group. For example, it doesn’t make sense to say blood group A < blood group B.
Ordinal factors: These are categorical variables that can be ordered. For instance, tumor sizes. We can say T1 (tumor size 2cm or smaller) < T2 (tumor size larger than 2cm but smaller than 5 cm).
R provides us with a way to impose order on factors. Simply use the argument ordered=TRUE inside the factor function.
# Specify the tumor size vectore
tumor_size <- c("T1", "T1", "T2", "T3", "T1")
# Set the order by specifying "ordered=TRUE"
tumor_size_factor <- factor(tumor_size, ordered = TRUE,
levels=c("T1", "T2", "T3"))
# Print the results
tumor_size_factor
## [1] T1 T1 T2 T3 T1
## Levels: T1 < T2 < T3
# Compare one factor vs the other
tumor_size_factor[1] < tumor_size_factor[2]
## [1] FALSE
Classwork/Homework: Use the inequality operator (< or >) on a nominal category and print the output.
Recall vectors and matrices can hold only one data type, like integer or character. Lists can hold multiple R objects, without having to perform coercion.
# Defining different data type as vector (Note, coercion takes place)
vec <- c("Blood-sugar", "High", 140, "mg/dL")
vec
## [1] "Blood-sugar" "High" "140" "mg/dL"
# And as a list
lst <- list("Blood-sugar", "High", 140, "mg/dL")
# One can use the is.list() function to see if something is a list
is.list(lst)
## [1] TRUE
lst
## [[1]]
## [1] "Blood-sugar"
##
## [[2]]
## [1] "High"
##
## [[3]]
## [1] 140
##
## [[4]]
## [1] "mg/dL"
Naming a list can be done through the names() function or specifying it in the list itself.
# Define a list
lst <- list("Blood sugar", "High", 140, "mg/dL")
# Assign names and print
names(lst) <- c("Fluid", "Category", "Value", "Units")
print(lst)
## $Fluid
## [1] "Blood sugar"
##
## $Category
## [1] "High"
##
## $Value
## [1] 140
##
## $Units
## [1] "mg/dL"
Or specify names directly while defining the list
# Specify while constructing the list
blood_test <- list(Fluid="Blood sugar", Category="High", Value=140, Units="mg/dL")
# For compact display use the str() function
str(blood_test)
## List of 4
## $ Fluid : chr "Blood sugar"
## $ Category: chr "High"
## $ Value : num 140
## $ Units : chr "mg/dL"
Note: A list can contain another list, or any number of nested lists.
The difference between [] and [[]] is that, [] will return a list back and [[]] will return the elements in the list.
# Creating a list of patient's details containing the 'blood_test' list
patient <- list(Name="Mike", Age=36, Btest = blood_test)
# Show the first element of the list
patient[1]
## $Name
## [1] "Mike"
class(patient[1])
## [1] "list"
# Access the content of the first element
patient[[1]]
## [1] "Mike"
class(patient[[1]])
## [1] "character"
# Show the structure of the third element of the list
str(patient[3])
## List of 1
## $ Btest:List of 4
## ..$ Fluid : chr "Blood sugar"
## ..$ Category: chr "High"
## ..$ Value : num 140
## ..$ Units : chr "mg/dL"
# Show the structure of the content of the third element (which in this case is a list by itself)
str(patient[[3]])
## List of 4
## $ Fluid : chr "Blood sugar"
## $ Category: chr "High"
## $ Value : num 140
## $ Units : chr "mg/dL"
Classwork/Homework:
patient[c(1,3)] give us? Is it a list or elements?patient[[c(1,3)]] give us? Is it a list or elements?patient[[c(3,1)]]? What is the difference?patient[[c(1,3)]] is same as patient[[1]][[3]]).Subsetting by names is super easy: just supply the name within brackets. For example, patient["Name"] or patient[["Name"]].
Subsetting by logicals will work only for returning elements of the list. For instance, patient[c(TRUE,FALSE)].
It doesn’t make sense to return the elements through logicals, i.e., patient[[c(TRUE,FALSE)]].
Another cool way to access elements (just the same as using [[]]) is the use of $ sign.
However, to do this, the list should be named. For example, patient$Name will print the patient name.
class(patient$Name)
## [1] "character"
$ sign can also be used for extending lists:
# Extend the list to include gender
patient$Gender <- "Male"
# This is same as using double brackets
patient[["Gender"]] <- "Male"
# Extend the blood test list to include the date of the test
patient$Btest$Date <- "Sept.14"
Classwork/Homework: How do you remove an element from a list?
Datasets come with various shapes and sizes. Usually they constitute:
Limitations of other data types:
Data frames can contain different values for each observation/row; however, each variable (or a column) should have the same data type.
Usually data frames are imported - through CSV, or Excel etc. However, we can create a data frame as well.
# Create name, age and logical vectors
name <- c("Anne", "James", "Mike", "Betty")
age <- c(20, 43, 27, 25)
cancer <- c(TRUE, FALSE, FALSE, TRUE)
# Form a data frame
df <- data.frame(name, age, cancer)
df
## name age cancer
## 1 Anne 20 TRUE
## 2 James 43 FALSE
## 3 Mike 27 FALSE
## 4 Betty 25 TRUE
Update the names attribute
# (the same way like we did for vectors)
names(df) <- c("Name", "Age", "Cancer_Status")
attributes(df)
## $names
## [1] "Name" "Age" "Cancer_Status"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3 4
# Or specify directly while creating the data frame
df <- data.frame(Name=name, Age=age, Cancer_Status=cancer)
df
## Name Age Cancer_Status
## 1 Anne 20 TRUE
## 2 James 43 FALSE
## 3 Mike 27 FALSE
## 4 Betty 25 TRUE
Classwork/Homework:
Note: Data frames store character vectors as factors. You can override this as follows:
df <- data.frame(Name=name, Age=age, Cancer_Status=cancer,
stringsAsFactors = FALSE)
print(df)
## Name Age Cancer_Status
## 1 Anne 20 TRUE
## 2 James 43 FALSE
## 3 Mike 27 FALSE
## 4 Betty 25 TRUE
We can subset by indices:
# Subsetting by indices - works just like matrices
df[1,2]
## [1] 20
# Get the entire row/column - just like matrices
# Get the second row
df[2,]
## Name Age Cancer_Status
## 2 James 43 FALSE
We can also subset using the names as well as indices:
# Get the "cancer" column
df[,"Cancer_Status"]
## [1] TRUE FALSE FALSE TRUE
# One can use column names as well
df[1, "Age"]
## [1] 20
# Get all 2nd and 3rd observation with "name"" and "cancer"" status
df[c(2,3), c("Name", "Cancer_Status")]
## Name Cancer_Status
## 2 James FALSE
## 3 Mike FALSE
The main difference in subsetting a data.frame versus a matrix is when you specify a single number as index within []. For matrices you get an element corresponding to the linear index but for a data frame we’ll get the column vector that corresponds to the index.
An example:
# Print the class (of the values) of the second column
class(df[,2])
## [1] "numeric"
# Class of the retrieved element, using a single bracket
class(df[2])
## [1] "data.frame"
This is because data frames are made up of lists of vectors of equal length. Thus, single [2] will correspond to the second element in the list, which is a vector of ages.
Classwork/Homework: Test the operations of lists (like ["age"] & [["age"]]) on data frames.
Adding a column is super easy. Since data frames are a list of vectors one can just append a vector to the list.
For instance, if we have a column of tumor size info like this for each patient: c("T0","T3","T2","T0"), the following code will append the vector.
# Append tumor size vector
df$Tumor_size <- c("T3", "T0", "T0", "T2")
df
## Name Age Cancer_Status Tumor_size
## 1 Anne 20 TRUE T3
## 2 James 43 FALSE T0
## 3 Mike 27 FALSE T0
## 4 Betty 25 TRUE T2
Classwork/Homework:
cbind() to append a vector of choice.In contrast, extending a row (or observation) is not straight-forward. This is because observations may contain different data types. To add observations, make a new data frame and append:
# Create a data frame (pay attention to the capital letters at the variable names!)
tom <- data.frame(Name="Tom", Age=47, Cancer_Status="TRUE", Tumor_size="T2")
# And append
df <- rbind(df, tom)
df
## Name Age Cancer_Status Tumor_size
## 1 Anne 20 TRUE T3
## 2 James 43 FALSE T0
## 3 Mike 27 FALSE T0
## 4 Betty 25 TRUE T2
## 5 Tom 47 TRUE T2
Classwork/Homework:
list() function instead of the data frame function in the above code?name=, age= etc. in the above code?expand.grid(), what is it for?expand.grid(height = as.character(seq(60, 70, 5)), weight = seq(100, 200, 50),
sex = c("Male","Female"), stringsAsFactors = FALSE)
We can use the order() function to sort the entire data frame with respect to a particular column.
# Rank the entries of a column, say "Age"
ranks <- order(df$Age)
# `ranks` is a vector of indexes
print(ranks)
## [1] 1 4 3 2 5
# Sort the data frame according to the rank
df[ranks,]
## Name Age Cancer_Status Tumor_size
## 1 Anne 20 TRUE T3
## 4 Betty 25 TRUE T2
## 3 Mike 27 FALSE T0
## 2 James 43 FALSE T0
## 5 Tom 47 TRUE T2
Classwork/Homework:
sort(df$age) return an error?In this part of the lecture we will learn about graphics in R.
R has very strong graphical capabilities - this is the primary reason why both industries and academics are interested.
Packages are extensions of R functionality, adding a new set of functions that are tailored to handle tasks with a common purpose. When a new package is loaded, it often happens that some of the new functions have the same name of currently loaded ones. When happens, we get a “conflict event” and R warns you and list all commands that are going to mask previously loaded ones.
Packages are loaded using the library() function.
When a package is being loaded, it can load other packages that it depends on.
Sometimes we want to use a function from a specific package just once, at this case, instead of loading the whole package to the memory, we can refer to it directly by using the :: operator. For example: Hmisc::cut2()
This package is part of the default list of loaded packages when you start R. It has many functions. Primarily plot() and hist() provide essential functionalities.
The plot() function is generic, which means:
Before we see how the plot function works, we will first import a public health data set. We will work with HANES data set which is New York City’s Health and Nutrition survey data set. For more info about HANES, click here.
# If needed, install RCurl package, then load the package
# install.packages("RCurl")
library(RCurl)
## Loading required package: bitops
# Import the HANES data set from GitHub; break the string into two for readability
# (Please note this readability aspect very carefully)
URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
URL_text_2 <- "/master/Datasets/HANES/NYC_HANES_DIAB.csv"
# Paste it to constitute a single URL
URL <- paste(URL_text_1, URL_text_2, sep="")
HANES <- read.csv(text = getURL(URL))
We now observe the structure of the data.
# Observe the structure
str(HANES)
## 'data.frame': 1527 obs. of 23 variables:
## $ KEY : Factor w/ 1527 levels "133370A","133370B",..: 28 32 43 44 53 55 70 84 90 100 ...
## $ GENDER : int 1 1 1 1 1 1 1 1 1 1 ...
## $ SPAGE : int 29 27 28 27 24 30 26 31 32 34 ...
## $ AGEGROUP : int 1 1 1 1 1 1 1 1 1 1 ...
## $ HSQ_1 : int 2 2 2 2 1 1 3 1 2 1 ...
## $ UCREATININE : int 105 296 53 314 105 163 150 46 36 177 ...
## $ UALBUMIN : num 0.707 18 1 8 4 3 2 2 0.707 4 ...
## $ UACR : num 0.00673 6 2 3 4 ...
## $ MERCURYU : num 0.37 NA 0.106 0.487 2.205 ...
## $ DX_DBTS : int 3 3 3 3 3 3 3 3 3 3 ...
## $ A1C : num 5 5.5 5.2 4.8 5.1 4.3 5.2 4.8 5.2 4.8 ...
## $ CADMIUM : num 0.2412 0.4336 0.1732 0.0644 0.0929 ...
## $ LEAD : num 1.454 0.694 1.019 0.863 1.243 ...
## $ MERCURYTOTALBLOOD: num 2.34 3.11 2.57 1.32 14.66 ...
## $ HDL : int 42 52 51 42 61 52 50 57 56 42 ...
## $ CHOLESTEROLTOTAL : int 184 117 157 145 206 120 155 156 235 156 ...
## $ GLUCOSESI : num 4.61 4.5 4.77 5.16 5 ...
## $ CREATININESI : num 74.3 80 73 80 84.9 ...
## $ CREATININE : num 0.84 0.91 0.83 0.91 0.96 0.75 0.99 0.9 0.84 0.93 ...
## $ TRIGLYCERIDE : int 156 63 43 108 65 51 29 31 220 82 ...
## $ GLUCOSE : int 83 81 86 93 90 92 85 72 87 96 ...
## $ COTININE : num 31.5918 57.6882 0.0635 0.035 0.0514 ...
## $ LDLESTIMATE : int 111 52 97 81 132 58 99 93 135 98 ...
Note that GENDER, AGEGROUP and HSQ_1 are integers but in fact they should be factors! So, we need to convert them to factors.
# Convert them to factors
HANES$GENDER <- as.factor(HANES$GENDER)
HANES$AGEGROUP <- as.factor(HANES$AGEGROUP)
HANES$HSQ_1 <- as.factor(HANES$HSQ_1)
# Now observe the structure
str(HANES)
## 'data.frame': 1527 obs. of 23 variables:
## $ KEY : Factor w/ 1527 levels "133370A","133370B",..: 28 32 43 44 53 55 70 84 90 100 ...
## $ GENDER : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ SPAGE : int 29 27 28 27 24 30 26 31 32 34 ...
## $ AGEGROUP : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ HSQ_1 : Factor w/ 5 levels "1","2","3","4",..: 2 2 2 2 1 1 3 1 2 1 ...
## $ UCREATININE : int 105 296 53 314 105 163 150 46 36 177 ...
## $ UALBUMIN : num 0.707 18 1 8 4 3 2 2 0.707 4 ...
## $ UACR : num 0.00673 6 2 3 4 ...
## $ MERCURYU : num 0.37 NA 0.106 0.487 2.205 ...
## $ DX_DBTS : int 3 3 3 3 3 3 3 3 3 3 ...
## $ A1C : num 5 5.5 5.2 4.8 5.1 4.3 5.2 4.8 5.2 4.8 ...
## $ CADMIUM : num 0.2412 0.4336 0.1732 0.0644 0.0929 ...
## $ LEAD : num 1.454 0.694 1.019 0.863 1.243 ...
## $ MERCURYTOTALBLOOD: num 2.34 3.11 2.57 1.32 14.66 ...
## $ HDL : int 42 52 51 42 61 52 50 57 56 42 ...
## $ CHOLESTEROLTOTAL : int 184 117 157 145 206 120 155 156 235 156 ...
## $ GLUCOSESI : num 4.61 4.5 4.77 5.16 5 ...
## $ CREATININESI : num 74.3 80 73 80 84.9 ...
## $ CREATININE : num 0.84 0.91 0.83 0.91 0.96 0.75 0.99 0.9 0.84 0.93 ...
## $ TRIGLYCERIDE : int 156 63 43 108 65 51 29 31 220 82 ...
## $ GLUCOSE : int 83 81 86 93 90 92 85 72 87 96 ...
## $ COTININE : num 31.5918 57.6882 0.0635 0.035 0.0514 ...
## $ LDLESTIMATE : int 111 52 97 81 132 58 99 93 135 98 ...
Let’s plot a categorical variable, for instance gender.
# Plot the factor gender
plot(HANES$GENDER)
Classwork/Homework:
Let’s now plot a numerical variable.
# Plot a numerical variable
plot(HANES$A1C)
Of course, we can plot two numerical variables:
# Plot two numerical variables
# A1C - Hemoglobin percentage, UACR - Urine Albumin/Creatinine Ratio
plot(HANES$A1C, HANES$UACR)
Note that R autamatically renders them as a scatter plot and set the axes scale based on the range of the variables:
min(HANES$A1C, na.rm = T); max(HANES$A1C, na.rm = T)
## [1] 3
## [1] 13.4
min(HANES$UACR, na.rm = T); max(HANES$UACR, na.rm = T)
## [1] 0.002412969
## [1] 5327
For the purpose of learning Rmarkdown, have a look at the output of same code above, this time we used the option: results=‘hold’
min(HANES$A1C, na.rm = T); max(HANES$A1C, na.rm = T)
min(HANES$UACR, na.rm = T); max(HANES$UACR, na.rm = T)
## [1] 3
## [1] 13.4
## [1] 0.002412969
## [1] 5327
However, this plot is uninformative as the data is unevenly scattered. One can scale the data using the “ylim” argument:
# Plot two numerical variables with appropriate scaling
plot(HANES$A1C, HANES$UACR, ylim=c(0, 10))
Although the scaling is okay now, the relationship is extremely complicated.
One of the transformations that helps us to understand relationships between the variables is the log() function.
We can apply logrithm to both variables -
# Transform the data using the log function and plot the result
plot(log(HANES$A1C), log(HANES$UACR))
We note that there are two different clusters of patients - one with low UACR values and another with high UACR values, both corresponding to a mean \(log(A1C)\) of about \(1.7\).
We can also plot two categorical variables. Let us plot GENDER and AGEGROUP factors.
Lets change the texts to render something more informative (based on the HANES codebook):
# Rename the GENDER factor for identification
HANES$GENDER <- factor(HANES$GENDER, labels=c("M","F"))
# Rename the AGEGROUP factor for identification
HANES$AGEGROUP <- factor(HANES$AGEGROUP, labels=c("20-39","40-59","60+"))
# Plot GENDER vs AGEGROUP
plot(HANES$GENDER, HANES$AGEGROUP)
Note that R already prints proportion as it displays the plots. The first element is the \(x\)-axis and the second element is the \(y\)-axis.
Now, let’s switch the order:
# Swap AGEGROUP vs GENDER
plot(HANES$AGEGROUP, HANES$GENDER)
Next, let’s explore the hist() function. hist() is a short form for histogram.
The hist() function:
Here is an example to find the distribution of A1C variable for the male population.
First select only the male population:
# Form a logical vector consisting only the MALE gender
HANES_MALE <- HANES$GENDER == "M"
# Select only the records for the male population
MALES_DF <- HANES[HANES_MALE,]
Now, let’s make an histogram for the above selected male population:
# Make an historgam
hist(MALES_DF$A1C)
Observe that the Glycohemoglobin percentage lies between \(5-6\) for most of the men (the mode).
Note that R has also chosen the number of bins automatically.
You can increase (or decrease) the number of bins using the “breaks” argument.
There are other cool tools like barplot(), boxplot(), pairs() in the graphics package.
The plot system allows to add different plots one on top of the other.
For example, on top of the histogram, let’s add a vertical line represents the mean of the distribution
# Make an historgam
hist(MALES_DF$A1C)
# Add a vertical line, supplying the x-axis value
abline(v = mean(MALES_DF$A1C, na.rm = T), col="red")
Classwork/Homework:
How does this plot look?
# Plot LDL values vs HDL values
plot(HANES$LDL, HANES$HDL)
compared to this -
# Plot GLUCOSE vs GLUCOSESI with parameters
plot(HANES$GLUCOSE, HANES$GLUCOSESI,
xlab= "Plasma Glucose [mg/dL]", ylab = expression(paste("Blood Glucose SI units [", mu, "mole/L]")),
main = "Plasma vs Blood Glucose", type = "o", col="blue")
Classwork/Homework: Check the Hmisc::label() function. In accordance to the graph above, think how one can leverage this function to save some typing when plotting several graphs with the same variable? Give an example.
To do good data science, it certainly not only helps to know correlations between the variables (in the above figure, we know blood glucose levels and plasma glucose levels are the same), but how we present the data matters!
Some plot function characteristics:
xlab: Horizontal axis label
ylab: Vertical axis label
main: Plot title
type: Plot type
col: Plot color
Classwork/Homework: Change the type to “l” and report the plot type.
Graphical parameters are not maintained throughout session. If you want to maintain graphical parameters, use the par() function. For example,
# Set the graphical parameter par's so that color red is held
par(col="red")
# Plot LDL vs HDL
plot(HANES$LDL, HANES$HDL)
# Now make another plot:
# This time Hemoglobin vs HDL
plot(HANES$A1C, HANES$HDL)
Tip: As our commands become more and more complex and ask for more and more arguments, the specification of the dataset name again and again becomes onerous. To save some typing we have the function with(). Here is an example:
# Set the graphical parameter par's so that color red is held
par(col="red")
# Plot LDL vs HDL
with(HANES, plot(LDL, HDL), xlab = label(LDL))
# Now make another plot:
# This time Hemoglobin vs HDL
with(HANES, plot(A1C, HDL), ylab = label(HDL))
More graphical parameters:
col.main: Color of the main title
cex.axis: Size of the axis numbers (towards 0 is more smaller). Just like “col” parameter has variants such as “main”, “cex” also has other variants - “axis” is one of them.
lty: Specifies the line type - solid, dashed etc. (1 is a full line, 2 is dashed etc.)
pch: The style of the symbol. More than 35 types of symbols.
So far we saw single plots of data, with no combinations and layers. It may be good to plot several. We can use “mfrow” with the par() function.
# Set the par function with mfrow to 2x2 "grid"
par(mfrow = c(2,2))
# Plot LDL vs HDL
plot(HANES$LDL, HANES$HDL)
# Plot A1C vs HDL
plot(HANES$A1C, HANES$HDL)
# Plot GLUCOSE vs HDL
plot(HANES$GLUCOSE, HANES$HDL)
# Plot CHOLESTEROLTOTAL vs HDL
plot(HANES$CHOLESTEROLTOTAL, HANES$HDL)
Classwork/Homework: Do the above exercise with “mfcol” argument. How does it plot?
To reset the plot to 1 figure, one can use par(mfrow = c(1,1)), that will get us back to normal.
layout() functionFacilitates more complex plot arrangements.
# Create a grid on how our figures should appear
grid <- matrix(c(1,1,2,3), nrow=2, ncol=2, byrow=TRUE)
# Pass it to the layout function
layout(grid)
# Plot LDL vs HDL
plot(HANES$LDL, HANES$HDL)
# Plot GLUCOSE vs HDL
plot(HANES$GLUCOSE, HANES$HDL)
# Plot CHOLESTEROLTOTAL vs HDL
plot(HANES$CHOLESTEROLTOTAL, HANES$HDL)
# Reset the layout
layout(1)
Tip: Resetting everytime might be too tedious. A trick is to assign the old setting to an object and reuse it when necessary:
# Assign the old parameters to an object
old_parameters <- par()
# Change to new parameters
par(col="red")
plot(HANES$LDL, HANES$HDL)
# Reset to old parameters
par(old_parameters)
# Test the original settings
plot(HANES$LDL, HANES$HDL)
Stacking graphical elements is a great way of adding more information to the plots:
# Plot A1C vs GLUCOSESI
plot(HANES$A1C, HANES$GLUCOSESI, xlim=c(6,8), ylim=c(3,10))
# Using linear fit model.
# Note: `lm()` function will return a vector of coefficients for the fit
lm_glucose_SI <- lm(HANES$A1C ~ HANES$GLUCOSESI)
# Stack the linear model on top of the plot with line width 2 (specified by `lwd` argument)
abline(coef(lm_glucose_SI), lwd = 2)
Classwork/Homework: Make a plot and add elements through the functions points(), lines(), segments() and text().